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Abstract. While for deterministic systems, a counterexample to a property can 
simply be an error trace, counterexamples in probabilistic systems are necessarily 
more complex. For instance, a set of erroneous traces with a sufficient cumulative 
probability mass can be used. Since these are too large objects to understand and 
manipulate, compact representations such as subchains have been considered. In the 
case of probabilistic systems with non-determinism, the situation is even more com¬ 
plex. While a subchain for a given strategy (or scheduler, resolving non-determinism) 
is a straightforward choice, we take a different approach. Instead, we focus on the 
strategy—which can be a counterexample to violation of or a witness of satisfaction 
of a property—itself, and extract the most important decisions it makes, and present 
its succinct representation. 

The key tools we employ to achieve this are (1) introducing a concept of importance 
of a state w.r.t. the strategy, and (2) learning using decision trees. There are three 
main consequent advantages of our approach. Firstly, it exploits the quantitative 
information on states, stressing the more important decisions. Secondly, it leads to 
a greater variability and degree of freedom in representing the strategies. Thirdly, 
the representation uses a self-explanatory data structure. In summary, our approach 
produces more succinct and more explainable strategies, as opposed to e.g. binary 
decision diagrams. Finally, our experimental results show that we can extract several 
rules describing the strategy even for very large systems that do not fit in memory, 
and based on the rules explain the erroneous behaviour. 


1 Introduction 


The standard models for dynamic stochastic systems with both probabilistic and nondeter- 
ministic behaviour are Markov decision processes (MDPs) BHow60lPut94lFV97l . They are 
widely used in verification of probabilistic systems HBK08IKNP1 111 in several ways. Firstly, 
in concurrent probabilistic systems, such as communication protocols, the nondeterminism 
arises from scheduling IICY95IVar85l . Secondly, probabilistic systems operating in open 
environments, such as various stochastic reactive systems, respond to nondeterministic 
inputs | Seg95ldA97l l. Thirdly, for underspecified probabilistic systems, a controller is syn¬ 
thesized, resolving the nondeterminism in a way that optimizes some objective, such as 
energy consumption or time constraints in embedded systems I1BK081KNP111 . 

In analysis of MDPs, the behaviour under all possible strategies (schedulers, controllers, 
policies) is examined. For example, in the first two cases, the result of the verification 
process is either a guarantee that a given property holds under all strategies, or a counterex¬ 
ample strategy. In the third case, either a witness strategy guaranteeing a given property is 














synthesized, or its non-existence is stated. In all settings, it is desirable that the output strate¬ 
gies should be “small and understandable ” apart from correct. Intuitively, it is a strategy 
with a representation small enough for the human debugger to read and understand where 
the bug is (in the verification setting), or for the programmer to implement in the device 
(in the synthesis setting). In this paper, we focus on the verification setting and illustrate 
our approach mainly on probabilistic protocols. Nonetheless, our results immediately carry 
over to the synthesis setting. 

Obtaining a small and simple strategy may be impossible if the strategy is required to 
be optimal, i.e., in our setting reaching the error state with the highest possible probability. 
Therefore, there is a trade-off between simplicity and optimality of the strategies. However, 
in order to debug a system, a simple counterexample or a series thereof is more valuable 
than the most comprehensive, but incomprehensible counterexample. In practice, a simple 
strategy reaching the error with probability smaller by a factor of e, e.g. one per cent, 
is a more valuable source of information than a huge description of an optimal strategy. 
Similarly, controllers in embedded devices should be as optimal as possible, but only as 
long as they are small enough to fit in the device. In summary, we are interested in finding 
small and simple close-to-optimal (e-optimal) strategies. 

How can one obtain a small and simple strategy? This seems to require some understand¬ 
ing of the particular system and the bug. How can we do something like that automatically? 
The approaches have so far been limited to BDD representation of the strategy, or gen¬ 
erating subchains representing a subset of paths induced by the strategy. While BDDs 
provide a succinct representation, they are not well readable and understandable. Further, 
subchains do not focus on the decisions the strategy makes at all. In contrast, a huge effort 
has been spent on methods to obtain “understanding” from large sets of data, using machine 
learning methods. In this paper, we propose to extend their use in verification, namely of 
reachability properties, in several ways. Our first aim of using these methods is to efficiently 
exploit the structure that is present in the models, written in e.g. PRISM language with 
variables and commands. This structure gets lost in the traditional numerical analysis of the 
MDPs generated from the PRISM language description. The second aim is to distil more 
information from the generated MDPs, namely the importance of each decision. Both lead 
to an improved understanding of the strategy’s decisions. 

Our approach. We propose three steps to obtain the desired strategies. Each of them has a 
positive effect on the resulting size. We describe these three steps and illustrate them on an 
example. 

(1) Obtaining a (possibly partially defined and liberal) e-optimal strategy. The £-optimal 
strategies produced by standard methods, such as value iteration of PRISM, may be too 
large to compute and overly specific. Firstly, as argued in lBCC + 14l . typically only a small 
fraction of the system needs to be explored in order to find an e-optimal strategy, whereas 
most states are reached with only a very small probability. Without much loss, the strategy 
may not be defined there. For example, in the MDP M depicted in Fig.[T] the decision in 
q (and v t Y) is almost irrelevant for the overall probability of reaching / from s. Such a 
partially defined strategy can be obtained using learning methods |BCC + I4tt . 

Secondly, while the usual strategies prescribe which action to play, liberal strate¬ 
gies leave more choices open. There are several advantages of liberal strategies, 
and similar notions of strategies called permissive strategies have been studied 
in lBJW02IBMQUlllDFK + 14i . A liberal strategy, instead of choosing an action in each 
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Fig. 1: An MDP M with reachability objective t 

state, chooses a set of actions to be played uniformly at every state. First, each liberal 
strategy represents a set of strategies, and thus covers more behaviour. Second, in counter¬ 
example guided abstraction-refinement (CEGAR) analysis, since liberal strategies can 
represent sets of counter-examples, they accelerate the abstraction-refinement loop by rul¬ 
ing out several counter-examples at once. Finally, they also allow for more robust learning 
of smaller strategies in Step 3. We show that such strategies can be obtained from standard 
value iteration 1KP13I as well as |BCC + 14| . Further processing of the strategies in Step 2 
and 3 allows liberal strategies as input and preserves liberty in the small representation of 
the strategy. 

(2) Identifying important parts of the strategy. We define a concept of importance of a 
state w.r.t. a strategy, corresponding to the probability of visiting the state by the strategy. 
Observe that only a fraction of states can be reached while following the strategy, and thus 
have positive importance. On the unreachable states, with zero importance, the definition 
of the strategy is completely useless. For instance, in M, both states p and q must have 
been explored when constructing the strategy in order to find out whether it is better to take 
action a or h. However, if the resulting strategy is to use b and d, the information what to do 
in uf s is useless. In addition, we consider vfs to be of zero importance, too, since they are 
never reached on the way to target. 

Furthermore, apart from ignoring states with zero importance, we want to partially 
ignore decisions that are unlikely to be made (in less important states such as q), and in 
contrast, stress more the decisions in important states likely to be visited (such as s). Note 
that this is difficult to achieve in data structures that remember all the stored data exactly, 
such as BDDs. Of course, we can store decisions in states with importance above a certain 
threshold. However, we obtain much smaller representation if we allow more variability 
and reflect the whole quantitative information, as shown in Step 3. 

(3) Data structures for compact representation of strategies. The explicit representation 
of a strategy by a table of pairs (state, action to play) results in a huge amount of data 
since the systems often have millions of states. Therefore, a symbolic representation by 
binary decision diagrams (BDD) looks as a reasonable option. However, there are several 
drawbacks of using BDDs. Firstly, due to the bit-level representation of the state-action 
pairs, the resulting BDD is not very readable. Secondly, it is often still too large to be 
understood by human, for instance due to a bad ordering of the variables. Thirdly, it cannot 
quantitatively reflect the differences in the importance of states. 

Therefore, we propose to use decision trees instead , e.g. HMit97l , a structure similar 
to BDDs, but with nodes labelled by various predicates over the system’s variables. They 
have several advantages. Firstly, they yield an explanation of the decision, as opposed to 
e.g. neural networks, and thus provide an explanation how the strategy works. Secondly, 
sophisticated algorithms for their construction, based on entropy, result in smaller represen- 
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tation than BDD, where a good ordering of variables is known to be notoriously difficult to 
find I BK08 I. Thirdly, as suggested in Step 2, they allow for less probable remembering of 
less stressed data if this sufficiently simplifies the tree and decreases its size. Finally, the 
major drawback of decision trees in machine learning—frequent overfitting of the training 
data—is not an issue in our setting since the tree is not used for classification of test data, 
but only of the training data. 

Summary of our contribution. In summary our contributions are as follows: 

- We provide a method for obtaining succinct representation of s-optimal strategies as 
decision trees. The method is based on a new concept of importance measure and on 
well-established machine learning techniques. 

- Experimental data show that even for some systems larger than the available memory, 
our method yields trees with only several dozens of nodes. 

- We illustrate the understandability of the representation on several examples from 
PRISM benchmarks BKNP121 . reading off respective bug explanations. 


Related work. In artificial intelligence, compact (factored) representations of MDP struc¬ 
ture have been developed using dynamic Bayesian networks IBDG95IKHW94II . prob¬ 
abilistic STRIPS I1KK99I . algebraic decision diagrams | HSaHB99 l, and also decision 
trees I BDG95 I. Formalisms used to represent MDPs can, in principle, be used to rep¬ 
resent values and policies as well. In particular, variants of decision trees are probably 
the most used ||BDG95|CK91IKP991 . For a detailed survey of compact representations 
see I BDH9 91. In the context of verification, MDPs are often represented using variants of 
(MT)BDDs idAKN+00lHKN+03lMP04l . and strategies by BDDs llWBB+lOl . 


Decision trees have been used in connection with real-time dynamic programming and 
reinforcement learning lBD96|Pye03). Learning a compact decision tree representation of a 
policy has been investigated in ISLTlOl for the case of body sensor networks, but the paper 
aims at a completely different application field (a simple model of sensor networks as 
opposed to generic PRISM models), uses different objectives (discounted rewards), and 
does not consider the importance of a state that, as we show, may substantially decrease 
sizes of resulting policies. 

Our results are related to the problem of computing minimal counterexamples in 
probabilistic verification. Most papers concentrate on solving this problem for Markov 
chains and linear-time properties iHKD09lADvR08lWJA + 14|jAK + 1 fl . branching-time 
properties HDHK08IFHPW 1 DIAL 1 01 . and in the context of simulation HKPC1 21 . A cou¬ 
ple of tools have been developed for probabilistic counterexample generation, namely 
DiPro IALFLS11 1 and COMICS |jAV+12t . For a detailed survey see I aBD+141 . 

Concerning MDPs, |WJA + 14 1 uses mixed integer linear programming to compute 
minimal critical sub-systems, i.e. whole sub-MDPs as opposed to a compact representation 
of “right” decisions computed by our methods. jAL09l uses a directed on-the-fly search to 
compute sets of most probable diagnostic paths (which somehow resembles our notion of 
importance), but the paths are encoded explicitly by AND/OR trees as opposed to our use of 
decision trees. Neither of these papers takes advantage of an internal structure of states and 
their methods substantially differ from ours. The notion of paths encoded as AND/OR trees 
has also been studied in ALL 1 31 to represent probabilistic counter-examples visually as fault 
trees, and then derive causal (the cause and effect) relationship between events. I Kl 109 1 
develops abstraction-based framework for model-checking MDPs based on games, which 
allows to trade compactness for precision, but does not give a procedure for constructing 
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(a compact representation of) counterexample strategies. lDJW + 14l computes a smallest 
set of guarded commands (of a PRISM-like language) that induce a critical subsystem, 
but, unlike our methods, does not provide a compact representation of actual decisions 
needed to reach an erroneous state and is incomplete as there is not always a command 
based counterexample. Moreover, all previous works considered a single strategy, and none 
of them considered computation and representation of liberal strategies. 

Counter-examples play a crucial role in CEGAR analysis of MDPs, and have been 
widely studied, such as, game-based abstraction refinement [ KNP061 ; non-compositional 
CEGAR approach for reachability MHWZ08I and safe-pCTL HCVIOI : compositional CE¬ 
GAR approach for safe-pCTL and qualitative logics 1KPC121CCD14I ; and abstraction- 
refinement for quantitative properties HDJJL01IDJJL021 . All of these works only consider a 
single strategy represented explicitly, whereas our approach considers a succinct represen¬ 
tation of a set of strategies, and can accelerate the abstraction-refinement loop. 

2 Preliminaries 

We use N, Q, and R to denote the sets of positive integers, rational and real numbers, 
respectively. The set of all rational probability distributions over a finite set X is denoted 
by Dist(X). Further, d € Dist(X) is Dirac if d(x) = 1 for some x £ X. Given a function 
/ : X —> R, we write argmax T , eX f(x) = {i £ X f(x) = max x / e jc /(a/)}. 

Markov chains. A Markov chain is a tuple M = (L. P. //) where L is a finite set of 
locations, P : L —» Dist(L) is a probabilistic transition function, and // £ Dist(L) is the 
initial probability distribution. 

A run in M is an infinite sequence uj = l\ (2 • • • of locations, a path in M is a finite 
prefix of a run. Each path w in M determines the set Conefw;) consisting of all runs that 
start with w. To M we associate the probability space (Runs, T, P), where Runs is the set 
of all runs in M, T is the cr-field generated by all Cone(w), and P is the unique probability 
measure such that P[Cone(f?i • • • i^)] = 

Markov decision processes. A Markov decision process (MDP) is a tuple G = 
(S, A, Act, S, s) where S is a finite set of states, A is a finite set of actions. Act : S —► 
2 a \ {0} assigns to each state s the set Act(s) of actions enabled in s, 6 : S x A —> Dist(S) 
is a probabilistic transition function that, given a state and an action, gives a probability 
distribution over the successor states, and s is the initial state. 

A run in G is an infinite alternating sequence of states and actions ui — .S | u, \ S 2 U 2 • • • 
such that for all i > 1, we have a* £ Act(si) and S(si, a,i)(s i+i) > 0 . A path of length k 
in G is a finite prefix w = siai • • • au-iSk of a run in G. 

Strategies and plays. Intuitively, a strategy (or a policy) in an MDP G is a “recipe” to 
choose actions. Formally, a strategy is a function er : S —> Dist(A) that given the current 
state of a play gives a probability distribution over the enabled actions^ In general, a 
strategy may randomize, i.e. return non-Dirac distributions. A strategy is deterministic if it 
gives a Dirac distribution for every argument. 

1 In general, a strategy is a function a : ( SA)*S —> Dist(A) that given a finite path w, representing 
the history of a play, gives a probability distribution over the actions enabled in the last state. 
However, for objectives considered in this paper, these more general strategies are not more 
powerful than our restricted memoryless strategies (depending on the last state visited). In order to 
simplify the notation, we thus only consider memoryless strategies in this paper. 
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A play of G determined by a strategy a and a state s £ S is a Markov chain G% where 
the set of locations is S, the initial distribution p is Dirac with p(s) = 1 and 

p (s){s') = Yl <T ( s )( a ) • <K S > a )( s ') • 

a£A 

The induced probability measure is denoted by PJ and “almost surely” or “almost all runs” 
refers to happening with probability 1 according to this measure. We usually write P CT 
instead of PJ (here s is the initial state of G). 

Liberal strategies. A liberal strategy is a function c : S' —>■ 2 A such that for every s £ S we 
have that 0 ^ ?(s) C Act(s). Given a liberal strategy and a state s, an action a £ Act(s) 
is good (in s w.r.t. <;) if a £ <r(s), and bad otherwise. Abusing notation, we denote by c the 
strategy that to every state s assigns the uniform distribution on <^(s) (which, in particular, 
allows us to use G) , P^ and apply the notion of e-optimality to <J. 

Reachability objectives. Given a set F C S of target states , we denote by 0 F the set of 
all runs that visit a state of F. For a state s £ S, the maximal reachability probability (or 
simply value) in s, is Val(s) := max CT PJ[0-F]- Given e > 0, we say that a strategy er is 
e-optimal if P' T [0 F] > Val(s) — e, and we call a 0-optimal strategy optimal^To avoid 
overly technical notation, we assume that states of F, subject to the reachability objective, 
are absorbing, i.e. for all s £ F, a £ Act(s) holds 6(s, a)(s) = 1. 

End components. A non-empty set S' C S is an end component (EC) of G if there is 
Act' : S' —> 2 a \ {0} such that (1) for all s £ S' we have Act'(s) C Act(s), (2) for all 
s £ S', we have a £ Act'(s) iff d(s, a ) £ Dist(S'), and (3) for all s,t £ S' there is a path 
w = Sidi • • • ak-iSk such that si = s, Sk = t, and S{ £ S ', tii £ Act'(si) for every i. An 
end component is a maximal end component (MEC) if it is maximal with respect to the 
subset ordering. Given an MDP, the set of MECs is denoted by MEC. Given a MEC, actions 
of Act'(s) and Act(s) \ Act'(s) are called internal and external (in s), respectively. 

3 Computing e-optimal Strategies 

There are many algorithms for solving quantitative reachability in Markov decision pro¬ 
cesses, such as the value iteration, the strategy improvement, linear programming based 
methods etc., see | Put94| |. The main method implemented in PRISM is the value iteration, 
which successively (under)approximates the value Val(s, a) = XveA a )( s, )' Cal(s') 
of every state-action pair (s, a) by a value V (s, a), and stops when the approximation is 
good enough. Denoting by V ( s ) := max ag ^ ct ( s ) V(s, a), every step of the value iteration 
improves the approximation V(s,a) by assigning V(s. a) := Yls'eS S(s,a)(s’) ■ V(s') 
(we start with V such that V(s) = 1 if s £ F, and V(s) =0 otherwise). 

The disadvantage of the standard value iteration (and also most of the above mentioned 
traditional methods) is that it works with the whole state space of the MDP (or at least 
with its reachable part). For instance, consider states iii, v t of Fig. 0 The paper llBCC+14i 
adapts methods of bounded real-time dynamic programming (BRTDP, see e.g. 1 MI.G05 I) to 
speed up the computation of the value iteration by improving V(s, a ) 13 only on “important” 
state-action pairs identified by simulations. 

2 For every MDP, there is a memoryless deterministic optimal strategy, see e.g. lPut94l . 

3 Here we use V for the lower approximation denoted by Vl in iBCC + 14j . 


6 













Even though RTDP methods may substantially reduce the size of an e-optimal strategy, 
its explicit representation is usually large and difficult to understand. Thus we develop 
succinct representations of strategies, based on decision trees, that will reduce the size even 
further and also provide a human readable representation. Even though the above methods 
are capable of yielding deterministic e-optimal strategies, that can be immediately fed into 
machine learning algorithms, we found advantageous to give the learning algorithm more 
freedom in the sense that if there are more e-optimal strategies, we let the algorithm choose 
(uniformly). This is especially useful within end-components where many actions have the 
same value. Therefore, we extract liberal e-optimal strategies from the value approximation 
V, output either by the value iteration or BRTDP. 


Computing liberal e-optimal strategies. Let us show how to obtain a liberal strategy q 
from the value iteration, or BRTDP. For simplicity, we start with MDP without MECs. 

MDP without end components. Say that V : Sxd-> [0,1] is a valid e-underapproximation 
if the following conditions hold: 

1. V(s,a) < Val(s,a) for all s £ S and a £ A 

2. Val(s) — V(s) < e 

3. V(s, a) < S(s, a)(s') ■ V(s') for all s £ S and a £ Acts 

The outputs V of both the value iteration, and BRTDP are valid e-underapproximations. 
We define a liberal strategy by <^(s) = argmax aeAct ( s ) V(s, a) for all s £ S. 

Lemma 1. For every e > 0 and a valid e-underapproximation V, s v is e-optimal. [^] 

General MDP. For MDPs with end components we have to extend the definition of the valid 
e-underapproximation. Given a MEC S' C S, we say that (s. a) £ S x A is maximal- 
external in S' if s £ S', a £ Act(s) is external and V(s,a) > V(s',a') for all s' £ S' 
and a' £ Act(s'). A state s' £ S' is an exit (of S') if (s,a) is ext-max in S' for some 
a £ Act(s). We add the following condition to the valid e-underapproximation: 

4. Each MEC S' C S has at least one exit. 

Now the definition of c , v is also more complicated: 

- For every s £ S which is not in any MEC, we put (s) = argmax ae>lct ( s ) V(s, a). 

- For every s £ S which is in a MEC S', 

• if s is exit, then <t v (s) = {a £ Act(s) | (s, a) is maximal-external in S'} 

• otherwise, S V (s) = {a £ Act(s) \ a is internal} 

Using these extended definitions, LemmafTlremains valid. Further, note that s v (s) is defined 
even for states with trivial underapproximation V ( s ) = 0, for instance a state s that was 
never subject to any value iteration improvement. Then the values <j(s) may not be stored 
explicitly, but follow implicitly from not storing any V (s), thus assuming V (s, •) = 0. 

4 Importance of Decisions 

Note that once we have computed an e-optimal liberal strategy c, we may, in principle, 
compute a compact representation of ? (using e.g. BDDs), and obtain a strategy with 
possibly smaller representation than above. 

4 Intuitively this means that randomizing among good actions of e-optimal strategies preserves 
e-optimality in the reachability setting (in contrast to other settings, e.g. with parity objectives). 
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However, we go one step further as follows. Given a liberal strategy <r and a state s £ S, 
we define the importance of s by 

Imp q {s) \= P^[0s | ()F] 

the probability of visiting s conditioned on reaching F (afterwards). Intuitively, the impor¬ 
tance is high for states where a good decision can help to reach the target]^] 

Example 1. For the MDP of Fig.[l]with the objective 0 { t } and strategy <; choosing b, we 
have Imp‘ ; {s) = 1 and Imp‘'{q ) = 5/995. Trivially, Imp^it) = 1. For all other states, the 
importance is zero. 

Obviously, decisions made in states of zero importance do not affect J[ x [OF’] since these 
states never occur on paths from s to F. However, note that many states of S may be 
reachable in G ? with positive but negligibly small probability. Clearly, the value of P ? [OF] 
depends only marginally on choices made in these states. Formally, let be a strategy 
obtained from c by changing each q(s) with Imp‘ ; (s) < A to an arbitrary subset of Act(s). 
We obtain the following obvious property: 

Lemma 2. For every liberal strategy we have Jim P^ [OF] = P 5 6 7 [OF]. 

In fact, every A < min {{Imp , '(s) \ s £ S'} \ {0}) satisfies P^ [OF] = P ? [OF]. But often 
even larger A may give P^ [OF] sufficiently close to P ? [OF]. Such A may be found using 
e.g. trial and error approachFj 

Most importantly, we can use the importance of a state to affect the probability that 
decisions in this state are indeed remembered in the data structure. Data structures with such 
a feature are used in various learning algorithms. In the next section, we discuss decision 
trees. Due to this extra variability, which decisions to learn, the resulting decision trees are 
smaller than BDDs for strictly defined ?/j. 

5 Efficient Representations 

Let G = ( S , A , Act, S, s) be an MDP. In order to symbolically represent strategies in G, 
we need to assume that states and actions have some internal structure. Inspired by PRISM 
language I KNP1 11, we consider a set V = {r> 1; ..., v„} of integer variables, each v, : gets 
its values from a finite domain Dom(vi). We suppose that S = Y\a=i Dom(vi) C Z ra . 
Further, we assume that the MDP arises as a product of m modules, each of which can 
separately perform non-synchronizing actions as well as synchronously with other modules 
perform a synchronizing action. Therefore, we suppose iCdxjO,..., m}, where A C N 
is a finite set and the second component determines the module performing the action (0 
stands for synchronizing actions)Fl 

5 Instead of the conditional probability of reaching s, we could consider the conditional expected 
number of visits of s. We discuss the differences and compare the efficiency together with the case 
of no conditioning on reaching the target in Section |6] 

6 One may even give a (quite conservative) bound on convergence of [OF] to P 5 [OF] as A —> 0, 
using e.g. Lemma 5.1 of 1 BKK14 I. However, for large MDPs the bound would be impractical. 

7 On the one hand, PRISM does not allow different modules to have local variables with the same 
name, hence we do not distinguish which module does a variable belong to. On the other hand, 
while PRISM declares no names for non-synchronizing actions, we want to exploit the connection 
between the corresponding actions of different copies of the same module. 
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Since a liberal strategy is a function of the form q : S —► 2 , assigning to each state its 
good actions, it can be explicitly represented as a list of state-action pairs, i.e., as a subset 
of 

n 

S X A = Dom(vi ) x A x {0,1,, m} (1) 

i=l 

In addition, standard optimization algorithms implemented in PRISM use an explicit “don’t- 
care” value —2 for action in each unreachable state, meaning the strategy is not defined. 
However, one could simply not list these pairs at all. Thus a smaller list is obtained, with only 
the states where q is defined. Recall that one may also omit states s satisfying Imp <; (s) = 0, 
thus ignoring reachable states with zero probability to reach the target. Further optimization 
may be achieved by omitting states s satisfying /mp ? (s) < A for a suitable A > 0. 

5.1 BDD Representation 

The explicit set representation can be encoded as a binary decision diagram (BDD). This 
has been used in e.g. flWBB + 10IEJPV12l . The principle of the BDD representation of a set 
is that (1) each element is encoded as a string of bits and (2) an automaton, in the form of 
a binary directed acyclic graph, is created so that (3) the accepted language is exactly the 
set of the given bit strings. Although BDDs are quite efficient, see Section[6j each of these 
three steps can be significantly improved: 

1. Instead of a string of bits describing all variables, a string of integers (one per variable) 
can be used. Branching is then done not on the value of each bit, but according to 
an inequality comparing the variable to a constant. This significantly improves the 
readability. 

2. Instead of building the automaton according to a chosen order of bits, we let a heuristic 
choose the order of the inequalities and the actual constants in the inequalities. 

3. Instead of representing the language precisely, we allow the heuristic to choose which 
data to represent and which not. The likelihood that each datum is represented corre¬ 
sponds to its importance, which we provide as another input. 

The latter two steps lead to significantly smaller graphs than BDDs. All this can be done in 
an efficient way using decision trees learning. 

5.2 Representation using Decision Trees 

Decision trees. A decision tree for a domain n?= -i Xi C Z d is a tuple T = (T, p, 9) where 
T is a finite rooted binary (ordered) tree with a set of inner nodes N and a set of leaves L, 
p assigns to every inner node a predicate of the form [a \ ~ const] where i £ {1..... r/}, 
Xi £ Xi, const £ Z, ~ £ {<, <, >, >, =}, and 9 assigns to every leaf a value good , or 

bad.^\ 

Similarly to BDDs, the language C{T) C N" of the tree is defined as follows. For a 
vector x = (x -\,..., x n ) £ N", we find a path p from the root to a leaf such that for each 
inner node n on the path, the predicate p(n) is satisfied by substitution = Xi iff the first 
child of n is on p. Denote the leaf on this particular path by l. Then x is in the language 
C(T) of r iff 9(d) = good. 

8 There exist many variants of decision trees in the literature allowing arbitrary branching, arbitrary 
values in the leaves, etc., e.g. lMit97l . For simplicity, we define only a special suitable subclass. 
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Example 2. Consider dimension d = 1, domain X\ = {1,..., 7}. A tree representing a set 
{1, 2, 3, 7} is depicted in Fig. [2] To depict the ordered tree clearly, we use unbroken lines 
for the first child, corresponding to the satisfied predicate, and dashed line for the second 
one, corresponding to the unsatisfied predicate. 


[ a?i < 3 j 



good 


good 


Fig. 2: A decision tree for {1, 2, 3, 7} C {1,..., 7} 


In our setting, we use the domain S x A defined by Equation (|Tj) which is of the 
form nlL + i 2 where for each 1 < * < n we have X, = Dom(vi), X n+ i = A and 
X n+ 2 = {0, 1,. .., m}. Here the coordinates Dom(vi) are considered “unbounded” and, 
consequently, the respective predicates use inequalities. In contrast, we know the possible 
values of A x {0,1,... ,m} in advance and they are not too many. Therefore, these 
coordinates are considered “discrete” and the respective predicates use equality. Examples 
of such trees are given in Section[6]in Fig. |4] and [5] Now a decision tree T for this domain 
determines a liberal strategy ? : S —> 2 A by a £ g(s) iff (s, a) £ £(7”). 

Learning. We describe the process of learning a training set, which can also be understood 
as storing the input data. Given a training sequence (repetitions allowed!) x 1 ,..., x k , with 
each x l = ( x \,..., x l n ) £ N d , partitioned into the positive and negative subsequence, the 
process of learning according to the algorithm ID3 !Qui86IMit971 proceeds as follows: 

1. Start with a single node (root), and assign to it the whole training sequence. 

2. Given a node n with a sequence r, 

- if all training examples in r are positive, set 9(n ) = good and stop; 

- if all training examples in r are negative, set 9{n ) = bad and stop; 

- otherwise, 

• choose a predicate with the “highest gain” (with lowest entropy, see e.g. lMrt97l 
Sections 3.4.1, 3.7.2]), 

• split r into sequences satisfying and not satisfying the predicate, assign them 
to the first and the second child, respectively, 

• go to step 2 for each child. 

Intuitively, the predicate with the highest gain splits the sequence so that it maximizes the 
portion of positive data in the satisfying subsequence and the portion of negative data in the 
non-satisfying subsequence. 

In addition, the final tree can be pruned. This means that some leaves are merged, 
resulting in a smaller tree at the cost of some imprecision of storing (the language of the 
tree changes). The pruning phase is quite sophisticated, hence for the sake of simplic¬ 
ity and brevity, we omit the details here. We use the standard C4.5 algorithm and refer 
to [Qui93lMit97ll . In Section[6j we comment on effects of parameters used in pruning. 

Learning a strategy. Assume that we already have a liberal strategy c : S —» 2 A . We 
show how we learn good and bad state-action pairs so that the language of the resulting tree 
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is close to the set of good pairs. The training sequence will be composed of state-action 
pairs where good pairs are positive examples, and bad pairs a negative ones. Since our aim 
is to ensure that important states are learnt and not pruned away, we repeat pairs with more 
important states in the training sequence more frequently. 

Formally, for every s £ S and a £ Act(s), we put the pair ( s,a ) to the training 
sequence repeai(s)-times, where 

repeat(s) = c ■ Imp'{s) 

for some constant c £ N (note that Imp f ( s ) < 1). Since we want to avoid exact computation 
of Imp^(s), we estimate it using simulations. In practice, we thus run c simulation runs that 
reach the target and set repecit(s) to be the number of runs where s was also reached. 


6 Experiments 

In this section, we present the experimental evaluation of the presented methods, which we 
have implemented within the probabilistic model checker PRISM I KNP1 11. All the results 
presented in this section were obtained on a single Intel(R) Xeon(R) CPU (3.50GHz) with 
memory limited to 10GB. 

First, we discuss several alternative options to construct the training data and to learn 
them in a decision tree. Further, we compare decision trees to other data structures, namely 
sets and BDDs, with respect to the sizes necessary for storing a strategy. Finally, we 
illustrate how the decision trees can be used to gain insight into our benchmark examples. 


6.1 Generating Training Data 

The strategies we work with come from two different sources. Firstly, we consider strategies 
constructed by PRISM, which can be generated using the explicit or sparse model checking 
engine. Secondly, we consider strategies constructed by the BRTDP algorithm jBCC + 14l . 
which are defined on a part of the state space only. 

Recall that given a strategy, the training data for the decision trees is constructed from c 
simulation runs according to the strategy. In our experiments, we found that c = 10000 
produces good results in all the examples we consider. Note that we stop each simulation as 
soon as the target or a state with no path to the target state is reached. 


6.2 Decision Tree Learning in Weka 

The decision trees are constructed using the Weka machine learning package ( HFH + 09 I. 
The Weka suite offers various decision tree classifiers. We use the J48 classifier, which is 
an implementation of the C4.5 algorithm ]Qui93| . The J48 classifier offers two parameters 
to control the pruning that affect the size of the decision tree: 

- Firstly, the leaf size parameter M determines that each leaf node with less than M 
instances in the training data is merged with its siblings. The value M can be any 
positive integer. However, only values smaller than the minimum number of instances 
per classification class are reasonable, since higher numbers always result in the trivial 
tree of size 1. 
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- The confidence factor C is used internally for determining the amount of pruning 
during decision tree construction. The value of C can be any double value in the half 
open interval (0, 0.5]. Smaller values incur more pruning and therefore smaller trees. 
More information and an empirical study of the parameters for J48 can be found in I DM12 I. 
Effects of the parameters. We illustrate the effects of the parameters C and M on the 
resulting size of the decision tree on the mer benchmark. However, similar behaviour 
appears in all the examples. Figures |3a| and |3b]show the resulting size of the decision tree. 
Each line in the plots corresponds to one decision tree, learned with 15 different values of 
the parameter. The C parameter scales linearly between 0.0001 and 0.5. The M parameter 
scales logarithmically between 1 and the minimum number of instances per class in the 
respective training set. The plots in Figure [3] show that M is an effective parameter in 
calibrating the resulting tree size, whereas C plays less of a role. Hence, we use C = 10” 4 . 
Furthermore, since the tree size is monotone in M, the parameter M can be used to retrieve 
a desired level of detail from tree. 



c 


(a) fixed M = 2 




(b) fixed C = 10 4 (c) Tree Size vs Error 

Fig. 3: Decision tree Parameters 


Figure [3c] depicts the relation of the tree size to the relative error of the induced strategy. 
It shows that there is a threshold size under which the tree is not able to capture the strategy 
correctly anymore and the error rises quickly. Above the threshold size, the error is around 
1%, considered reasonable in order to extract reliable information. This threshold behaviour 
is observed in all our examples. Therefore, it is sensible to perform a binary search for the 
highest M ensuring the error at most 1% and we do so in the next section. 


6.3 Results 

First, we briefly introduce the four examples from the PRISM benchmark suite IKNP12 I. 
which we tested our method on: 

firewire models the Tree Identify Protocol of the IEEE 1394 High Performance Serial Bus, 
which is used to transport video and audio signals within a network of multimedia devices. 
The reachability objective is that one node gets the root and the other one the child role, 
investor models a market investor and shares, which change their value probabilistically 
over time. The reachability objective is that the investor finishes a trade at a time, when his 
shares are more valuable than some threshold value. 

mer is a mutual exclusion protocol, that regulates the access of two users to two different 
resources. The protocol should prohibit that both resources are accessed simultaneously. 
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Table 1: Comparison of representation sizes for strategies obtained from PRISM and from 
BRTDP computation. Sizes are presented as the number of states for explicit lists of values, 
the number of nodes for BDDs, and the number of nodes for decision trees (DT). For DT, 
we display the tree size obtained from the binary search described above. Error reports the 
relative error of the strategy determined by the decision tree compared to the optimal value, 
obtained by model checking with PRISM. 


Example 

\s\ 

Value 

PRISM 

Explicit BDD DT Error 

Explicit 

BRTDP 

BDD DT 

Error 

firewire 

481,136 

1.000 

479,834 4,233 1 0.000% 

766 

4,763 1 

1 

0.000% 

investor 

35,893 

0.958 

28,151 783 27 0.886% 

21,931 

2,780 

35 

0.836% 

mer_17M 

1,773,664 

0.200 

Memory Out 

1,887 

619 

17 

0.000% 

mer_big 2 

Approx. 10 13 

0.200 

Memory Out 

1,894 

646 

19 

0.692% 

zeroconf 

89,586 

0.009 

60,463 409 7 0.106% 

1,630 

905 

7 

0.235% 


1 Note that BDDs represent states in binary form. Therefore, one entry in the explicit state list 
corresponds to several nodes in the BDD. 

2 We did not measure the state size as the MDP does not fit in memory, but extrapolated it from 
the linear dependence of model size and one of its parameters, which we could increase to 
2 31 — 1. The value is obtained from the BRTDP computation. 


Table 2: Effects of various learning vari- 
zeroconf is a network protocol which allows an ^ s on the tree size. Smallest trees corn- 
users to choose their IP addresses autonomously, puted from PRISM or BRTDP are pre- 
The protocol should detected and prohibit IP ad- sen ted. 
dress conflict. 

For every example, Table[l]shows the size of 
the state space, the value of the optimal strategy, 
and the sizes of strategies obtained from explicit 
model checking by PRISM and by BRTDP, for 
each discussed data structure. 

Learning variants. In order to justify our choice 
of the importance function Imp , we compare it to several alternatives. 

1. When constructing the training data, we can use the importance measure Imp, and add 
states as often as is indicated by its importance (I), or neglect it and simply add every 
visited state exactly once (O). 

2. Further, states on the simulation are learned conditioned on the fact that the target state 
is reached (O’). Another option is to consider all simulations (V). 

3. Finally, instead of the probability to visit the state (P), one can consider the expected 
number of visits (E). 

In Table [2] we report the sizes of the decision trees obtained for the all learning variants. 
We conclude that our choice (IOP) is the most useful one. 


Example 

[<>PIVPI0EIVEO0 ov 

firewire 

1 

1 

1 

1 

1 1 

investor 

27 

25 

31 

35 

37 37 

mer 17.VI 

17 

33 

17 

29 

19 none 

mer_big 

19 

23 

23 

37 

17 none 

zeroconf 

7 

7 

7 

7 

7 17 


6.4 Understanding Decision Trees 

We show how the constructed decision trees can help us to gain insight into the essential 
features of the systems. 
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zeroconf example. In Figure [4] we present a decision tree that is a strategy for 
zeroconf and shows how an unresolved IP address conflict can occur in the proto¬ 
col. First we present how to the read the strategy represented in Figure [4] Next we show 
how the strategy can explain the conflict in the protocol. Assume that we are classifying a 
state-action pair (s, a), where action a is enabled in state s. 

1. No matter what the current state s is, the action rec is always classified as bad 
according to the root of the tree. Therefore, the action rec should be played with 
positive probability only if all other available actions in the current state are also 
classified as bad. 

2. If action a is different from rec, the right son of the root node is 
reached. If action a is different from action 1> 0 &b=l&ip_mess = 1 -> 
b'=0&z'=0&nl' =min (nl + 1, 8) &ip_mess' =0 (the whole PRISM command is 
a single action), then a is classified as good in state s. Otherwise, the left son is reached. 

3. In node z < 0 the classification of action a (that is the action that labels the parent 
node) depends on the variable valuation of the current state. If the value of var. z is 
greater than 0, then a is classified as good in state s, otherwise it is classified as bad. 
Action rec stands for a network host receiving a reply to a broadcast message, resulting 

in resolution of an IP address conflict if one is present, which clearly does not help in 
constructing an unresolved conflict. The action labelling the right son of the root represents 
the detection of an IP address conflict by an arbitrary network host. This action is only 
good, if variable z, which is a clock variable, in the current state is greater than 0. The 
combined meaning of the two nodes is that an unresolved IP address conflict can occur if 
the conflict is detected too late. 



firewire example. For firewire, we obtain a trivial tree with a single node, labelled 
good. Therefore, playing all available actions in each state guarantees reaching the target 
almost surely. In contrast to other representations, we have automatically obtained the infor¬ 
mation that the network always reaches the target configuration, regardless of the individual 
behaviour of its components and their interleaving. 

mer example. In the case of mer, there exists a strategy that violates the required property 
that the two resources are not accessed simultaneously. The decision tree for the mer 
strategy is depicted in Figure [5] In order to understand how a state is reached, where both 
resources are accessed at the same time, it is necessary to determine which user accesses 
which resource in that state. 

1. The two tree nodes labelled by 1 explain what resource user 1 should access. The root 
node labelled by action sl=0&rl = 0 -> rl' =2 specifies that the request to access 

resource 2 (variable rl is set to 2) is classified as bad. The only remaining action for 
user 1 is to request access to resource 1. This action is classified as good by the right 
son of the root node. 
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2. Analogously, the tree nodes labelled by 2 specify that user 2 actions should request 
access to resource 2 (follows from s2 = 0&r2=0 -> r2' =2). Once resource 2 is 
requested it should change its internal state s2 to 1 (follows from s2=0&r2=2 -> 
s2' =1). It follows, that in the state violating the property, user 1 has access to resource 
1 and user 2 to resource 2. 

The model is supposed to correctly handle such overlapping requests, but fails to do 
so in a specific case. In order to further debug the model, one has to find the action of the 
scheduler that causes this undesired behaviour. The lower part of the tree specifies that 
ul_request_comm is a candidate for such an action. Inspecting a snippet of the code of 
ul_request_comm from the PRISM source code (shown below) reveals that in the given 
situation, the scheduler reacts inappropriately with some probability p. 

[ul_request_comm] s=0 & commUser=0 & driveUser!=0 & k<n -> 
(1-p):(s'=l) & (r'=driveUser) & (k'=k+l) + 

p:(s'=-l) & (gc'=true) & (k'=k+l) 

The remaining nodes of the tree that were not discussed are necessary to reset the 
situation if the non-faulty part (with probability 1 —p) of the ul_request_comm command 
was executed. It should be noted that executing the faulty ul_request_conun action does 
not lead to the undesired state right away. The action only grants user 1 access rights in 
a situation, where he should not get these rights. Only a successive action leads to user 1 
accessing the resource and the undesired state being reached. This is a common type of bug, 
where the command that triggered an error is not the cause of it. 


© 



Fig. 5: A decision tree for mer 


7 Conclusion 

In this work we presented a new approach to represent strategies in MDPs in a succinct 
and comprehensible way. We exploited machine learning methods to achieve our goals. 
Interesting directions of future works are to investigate whether other machine learning 
methods can be integrated with our approach, and to extend our approach from reachability 
objectives to other objectives (such as long-run average and discounted-sum). 
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